Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninstall finds and kills any running elastic-agent watch process #3384

Merged
merged 25 commits into from
Sep 21, 2023

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Sep 8, 2023

What does this PR do?

Finds any running elastic-agent watch process on the host and kills the running watcher.

Adds console output for uninstall so its clear to see what uninstall is performing.

Why is it important?

Fixes the following problems:

  • A orphan watcher process is not remaining when the Elastic Agent has been uninstalled.
  • Uninstall on Windows will no longer fail if an upgrade has occurred within the past 10 minutes.
  • A sequence of uninstall/re-install within the 10 minute will not cause a corrupted Elastic Agent.
  • Adds ability to track what uninstall is performing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test (tested by integration tests already)

How to test this PR locally

  1. Install Elastic Agent 8.10.0-SNAPSHOT
  2. Upgrade to a build with the PR, this will get the watcher running.
  3. Run sudo elastic-agent uninstall to get it to fully uninstall.

Related issues

Logs

After Upgrade:

% ps aux | grep elastic-agent
root             76886   0.1  0.1 409760672  47904   ??  Ss   11:23AM   0:01.33 elastic-agent
root             76927   0.1  0.3 409548672 100480   ??  S    11:23AM   0:00.28 /Library/Elastic/Agent/data/elastic-agent-693ca9/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-693ca9/run/system/metrics-default
blake            76955   0.0  0.0 408626944   1232 s003  R+   11:24AM   0:00.00 grep elastic-agent
root             76930   0.0  0.3 409531104  98992   ??  S    11:23AM   0:00.21 /Library/Elastic/Agent/data/elastic-agent-693ca9/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-693ca9/run/http/metrics-monitoring
root             76929   0.0  0.3 409525744  96864   ??  S    11:23AM   0:00.21 /Library/Elastic/Agent/data/elastic-agent-693ca9/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-693ca9/run/beat/metrics-monitoring
root             76928   0.0  0.3 409520032  95904   ??  S    11:23AM   0:00.26 /Library/Elastic/Agent/data/elastic-agent-693ca9/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///Library/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock -E path.data=/Library/Elastic/Agent/data/elastic-agent-693ca9/run/filestream-monitoring
root             76924   0.0  0.1 409355680  32224   ??  Ss   11:23AM   0:00.06 /Library/Elastic/Agent/elastic-agent watch --path.config /Library/Elastic/Agent --path.home /Library/Elastic/Agent

After Uninstall:

% ps aux | grep elastic-agent
blake            77071   0.0  0.0 408626944   1280 s003  S+   11:24AM   0:00.00 grep elastic-agent

Uninstall output:

Elastic Agent will be uninstalled from your system at /Library/Elastic/Agent. Do you want to continue? [Y/n]:Y
Stopping service... DONE
Stopping upgrade watcher; none found... DONE
Removing service... DONE
Removing install directory.... DONE
Elastic Agent has been uninstalled.

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team backport-skip labels Sep 8, 2023
@blakerouse blakerouse self-assigned this Sep 8, 2023
@blakerouse blakerouse requested a review from a team as a code owner September 8, 2023 15:32
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

elasticmachine commented Sep 8, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-21T13:06:23.722+0000

  • Duration: 30 min 37 sec

Test stats 🧪

Test Results
Failed 0
Passed 6313
Skipped 59
Total 6372

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Sep 8, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.78% (81/82) 👍
Files 66.102% (195/295) 👎 -0.225
Classes 65.693% (360/548) 👎 -0.057
Methods 52.769% (1134/2149) 👍 0.083
Lines 38.195% (12860/33669) 👎 -0.031
Conditionals 100.0% (0/0) 💚

@cmacknz
Copy link
Member

cmacknz commented Sep 8, 2023

You need to revert 94764be as part of this to see whether it actually fixes the tests.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, maybe needs some testing.
For the testing part I guess we need to create an e2e test that performs an upgrade so that we have the watcher in play, correct ?

@ycombinator
Copy link
Contributor

Should we also cleanup the Upgrade Marker file when we kill the Upgrade Watcher process? I think it would be good to try and maintain the invariant that the Upgrade Watcher is expected to be running if there's an Upgrade Marker file present.

@blakerouse
Copy link
Contributor Author

Should we also cleanup the Upgrade Marker file when we kill the Upgrade Watcher process? I think it would be good to try and maintain the invariant that the Upgrade Watcher is expected to be running if there's an Upgrade Marker file present.

This is only called for Uninstall that is removing everything anyway, do we need to specifically remove that file when killing the watcher?

@blakerouse
Copy link
Contributor Author

I am seeing this branch basically get all tests passing on Windows. I am seeing one issue with uninstall where it seems that the Elastic Agent is still running. I think its possible that on Windows that stopping the service doesn't mean that it fully gets stopped (which I believe we have another reported bug about). But I am not getting any issues with the watcher writing to the log file any more.

@ycombinator
Copy link
Contributor

This is only called for Uninstall that is removing everything anyway, do we need to specifically remove that file when killing the watcher?

True, we don't need to worry about cleaning up the Upgrade Marker file specifically because everything will get cleaned up during Uninstall anyway.

I'm working on #2706 and I plan to reuse the work you're doing here over there so I was thinking ahead to that. Don't worry about changing anything in this PR here. I'll add the cleanup if it makes sense when I work on #2706.

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2023

You should be able to revert bf467e3 on this branch as well to confirm that test passes with this fix.

kind: bug-fix

# Change summary; a 80ish characters long description of the change.
summary: Uninstall finds and kills any running watcher process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it'd be good to mention the uninstall tracking we well.

Comment on lines +49 to +71
// Succeeded step is done and successful.
func (pts *progressTrackerStep) Succeeded() {
prefix := " "
if pts.substeps {
prefix = pts.prefix + " "
}
if !pts.rootstep {
pts.tracker.printf("%sDONE\n", prefix)
}
pts.finalizeFunc()
}

// Failed step has failed.
func (pts *progressTrackerStep) Failed() {
prefix := " "
if pts.substeps {
prefix = pts.prefix + " "
}
if !pts.rootstep {
pts.tracker.printf("%sFAILED\n", prefix)
}
pts.finalizeFunc()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
you could make a function that takes a string (DONE or FAILED) and then Succeeded and Failed call it with the right string.
but the chance someone will need to change either Succeeded or Failed and forget the other is small, so it's realy up to you

@blakerouse blakerouse enabled auto-merge (squash) September 19, 2023 15:48
@blakerouse
Copy link
Contributor Author

stack never became ready, I will try again

@blakerouse
Copy link
Contributor Author

buildkite test this

1 similar comment
@blakerouse
Copy link
Contributor Author

buildkite test this

@ycombinator
Copy link
Contributor

ycombinator commented Sep 20, 2023

Tested the progress checker changes in this PR. I like how this looks!

$ sudo ./elastic-agent install
Elastic Agent will be installed at /Library/Elastic/Agent and will run as a service. Do you want to continue? [Y/n]:y
Do you want to enroll this Agent into Fleet? [Y/n]:n
Uninstalling current Elastic Agent...
   Stopping service... DONE
   Stopping upgrade watcher; none found... DONE
   Removing service... FAILED
   Removing install directory... DONE
   DONE
Copying files.................................................................... DONE
Installing service... DONE
Starting service... DONE
Elastic Agent has been successfully installed.

I have two concerns:

  • I don't think we should show the step about the upgrade watcher to the user. I think of the upgrade watcher as an implementation detail — something the Agent runs internally to monitor it's upgrade. So my preference would be to take out that step from the progress output.
  • I don't know how the nested steps would generalize for steps that take long. Specifically, the ...... could wrap around and "break" the indentation for a nested step that takes long. But I'm okay deferring this concern for another day; the only nested steps right now should all complete pretty quickly.

@blakerouse
Copy link
Contributor Author

I have two concerns:

  • I don't think we should show the step about the upgrade watcher to the user. I think of the upgrade watcher as an implementation detail — something the Agent runs internally to monitor it's upgrade. So my preference would be to take out that step from the progress output.

@cmacknz specifically asked for this information to be added, so its clear that a watcher was running and what PID was killed. I think overall its good to see, even if its implementation detail.

  • I don't know how the nested steps would generalize for steps that take long. Specifically, the ...... could wrap around and "break" the indentation for a nested step that takes long. But I'm okay deferring this concern for another day; the only nested steps right now should all complete pretty quickly.

Yeah I think we would need switch from .... to like a spinner to ensure that it doesn't wrap. Otherwise we track the terminal width which to me would be more extra code then needed.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress tracker related changes LGTM!

@mergify
Copy link
Contributor

mergify bot commented Sep 20, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b uninstall-kill-watcher upstream/uninstall-kill-watcher
git merge upstream/main
git push upstream uninstall-kill-watcher

@blakerouse
Copy link
Contributor Author

seems the stack was either removed mid test or was never really ready, trying again...

@blakerouse
Copy link
Contributor Author

buildkite test this

@elastic-sonarqube
Copy link

SonarQube Quality Gate

Quality Gate failed

Failed condition 28.5% 28.5% Coverage on New Code (is less than 40%)

See analysis details on SonarQube

@pierrehilbert pierrehilbert merged commit 7e86d24 into elastic:main Sep 21, 2023
10 of 11 checks passed
@ycombinator
Copy link
Contributor

@blakerouse Could we backport this PR to 8.10, since it's fixing a bug (#3371)? Also, this PR introduces a utils.GetWatcherPIDs() function which is needed in another 8.10 backport PR (#3488).

@blakerouse blakerouse added backport-v8.10.0 Automated backport with mergify and removed backport-skip labels Oct 2, 2023
mergify bot pushed a commit that referenced this pull request Oct 2, 2023
…3384)

* kill watcher on uninstall

* Empty commit.

* Fix killWatcher.

* Empty commit.

* Another fix for killWatcher.

* Empty commit.

* Catch ErrProcessDone.

* Empty commit.

* Empty commit

* Add changelog fragment.

* Make it work on Windows.

* Change killWatcher to be in a loop.

* Add loop to killWatcher.

* Revert "Skip TestStandaloneUpgradeFailsStatus to fix failing integration tests again (#3391)"

This reverts commit bf467e3.

* Revert "Fix integration tests by waiting for the watcher to finish during upgrade tests (#3370)"

This reverts commit 94764be.

* Fix test.

* Revert "Revert "Skip TestStandaloneUpgradeFailsStatus to fix failing integration tests again (#3391)""

This reverts commit 3b0f040.

* Add progress tracking for uninstall like install.

* Log when no watchers our found.

* Improve uninstall.

* Fix data race.

(cherry picked from commit 7e86d24)

# Conflicts:
#	internal/pkg/agent/cmd/install.go
#	internal/pkg/agent/install/install.go
#	internal/pkg/agent/install/progress.go
#	internal/pkg/agent/install/progress_test.go
@blakerouse
Copy link
Contributor Author

@ycombinator Yes, backporting it now.

@blakerouse
Copy link
Contributor Author

@ycombinator Actually this cannot be backported without backporting the work you did to add the installation output in 8.11. This work is based off of that. We can either backport that one as well, or not backport this one. Only backport the utils.GetWatcherPIDs() into the upgrade tests refactor backport. Thoughts?

@cmacknz your opinion as well?

@ycombinator
Copy link
Contributor

Only backport the utils.GetWatcherPIDs() into the upgrade tests refactor backport.

I'm in favor of this as the installation output work is an enhancement (not a bugfix).

@blakerouse blakerouse deleted the uninstall-kill-watcher branch October 3, 2023 15:12
@blakerouse
Copy link
Contributor Author

Only backport the utils.GetWatcherPIDs() into the upgrade tests refactor backport.

I'm in favor of this as the installation output work is an enhancement (not a bugfix).

Agreed. I will just add the utils.GetWatcherPIDs() to the backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.10.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uninstall does not stop a running watcher after upgrade
7 participants